System, method, and apparatus for dynamically replicating data for heterogeneous hadoop

ABSTRACT

A system for dynamically replicating data for a heterogeneous Hadoop is provided. The system includes a name node having a replication manager. The replication manager calculates a probability that a map task is allocated to a map task slot of a data node that stores an input data block of the map task out of a plurality of map task slots of data nodes by using a number of the map task slots in the Hadoop clusters comprised of heterogeneous clusters, and dynamically replicates data based on the probability.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to Korean Patent Application No.10-2015-0175832, filed on Dec. 10, 2015, the disclosure of which isincorporated herein by reference in its entirety.

BACKGROUND

The present disclosure relates to a system, a method, and an apparatusfor replicating data in a Hadoop, and more particularly, to a system, amethod, and an apparatus for dynamically replicating data for aheterogeneous Hadoop, which provide a dynamic data replication methodfor dynamically replicating data based on a probability that a map taskwill be allocated to a map task slot of an optimal data node by a datareplication method for a Hadoop having heterogeneous clusters, and adynamic data eviction method for evicting data based on a data accessfrequency.

In recent years, the biggest topic in the information technology (IT)industry is cloud computing.

The amount of data is increasing exponentially due to the spread ofmobile devices and tablets. And accordingly, the use of the term “bigdata” has begun, and the importance of cloud computing is increasing dayby day.

Cloud computing is a computing environment in which IT-related servicessuch as the storage of data, the use of network content, and the likecan be simultaneously used through a server on the Internet.

While cloud computing is getting spotlighted, interest in the Hadoop andMapReduce has also naturally been increased.

A Hadoop is a Java-based software framework and is also anopen-source-based distributed computing platform, which supports adistributed application program running on a big computer cluster thatcan process a massive amount of data.

The Hadoop is an open source framework that is composed of MapReduce,which is a distributed processing programming model, and a HadoopDistributed File System (HDFS), which is used throughout Hadoop.

MapReduce is a framework that distributes a massive amount of data toseveral nodes to process the data. An HDFS is a distributed computingplatform that is used when Hadoop processes a massive amount of data.

Conventionally, a Hadoop delivers a task to a data node that stores datain order to minimize network congestion and increase throughput of theentire system. The most ideal performance may be obtained when a task isperformed by a data node that stores a data block.

However, when all data nodes that store input data of tasks performtasks, the tasks are performed after the data is copied, and a delaytime is caused. Thus it causes to reduce the performance of MapReduce.

Conventionally, a Hadoop stores three copies of each piece of datastored in an HDFS in a corresponding data node, and does not have a datareplication method for dynamically adjusting the number of copiesaccording to a real-time data access request.

An access count for each piece of the data stored in the HDFS isdifferent, and it is inefficient to keep the number of copies of datafrequently requested to be accessed equal to the number of copies ofdata infrequently requested to be accessed.

This is because a Hadoop delivers a task to a data node at which inputdata is placed in order to minimize the usage of network bandwidth andmaximize throughput of MapReduce jobs.

The most ideal performance may be obtained when a task is performed by adata node that stores input data. However, when data has an accessrequest count greater than the number of its copies, there is anincreasing probability that a task will be allocated to a data node thatdoes not store data.

Accordingly, a task is performed after an input data block isadditionally copied to a currently available data node. This may cause adelay time, thereby reducing performance of MapReduce jobs.

SUMMARY OF THE DISCLOSURE

The present disclosure is directed to providing a system, a method, andan apparatus for dynamically replicating data for a heterogeneousHadoop, which provides a dynamic data replication method for dynamicallyreplicating data based on a probability that a map task will beallocated to a map task slot of an optimal data node by a datareplication method for a Hadoop composed of heterogeneous clusters, anda dynamic data eviction method for evicting data based on a data accessfrequency.

According to one aspect of the present disclosure, a system fordynamically replicating data for a heterogeneous Hadoop is provided. Thesystem includes a name node including a replication manager, wherein thereplication manager calculates a probability that a map task isallocated to a map task slot of a data node that stores an input datablock of the map task out of a plurality of map task slots of data nodesby using a number of the map task slots in the Hadoop clusterscomprising of heterogeneous clusters, and dynamically replicates databased on the probability.

The map manager includes a dynamic data replica creator classifying afirst map task slot of Data-Local-Map-Task-Slot of the data node thatstores the input data block of the map task and a second map task slotof Rack-Local-Map-Task-Slot of a data node that does not store the inputdata block of the map task, and calculating the probability that the maptask is allocated to the map task slot of the data node that does notperform a task and stores the input data block of the map task out ofthe plurality of map task slots of the data nodes using Bayesian theoryand the following equation:

P(DLMT)=P(Data Local Slots|Empty Slots)   [Equation]

where P(DLMT) denotes the probability to select Data local Slots, eachof which is the map task slot of the data node that stores the inputdata of the map task in the data node, out of Empty Slots, which are themap task slots that do not perform the task.

The map manager includes a dynamic data replica creator classifying afirst map task slot of Data-Local-Map-Task-Slot of a data node thatstores the input data block of the map task and a second map task slotof Rack-Local-Map-Task-Slot of a data node that does not store the inputdata block of the map task, and calculating the probability that the maptask is allocated to the map task slot of the data node that does notperform a task and stores the input data block of the map task out ofthe plurality of map task slots of the data nodes using Bayesian theoryand the Equation 1 below:

                                     [Equation  1] $\begin{matrix}{{P({DLMT})} = \frac{{P( {{Empty}\mspace{14mu} {Slots}} \middle| {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} \times {P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}}{P( {{Empty}\mspace{14mu} {Slots}} )}} \\{= \frac{\frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} \times \frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}{\frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}}\end{matrix}$

where P(Empty Slots|Data Local Slots) denotes a probability to selectEmpty Slots, each of which is the map task slot that does not performthe task, out of Data Local Slots, which are map task slots of datanodes that store the input data of the map task in the plurality of datanodes, wherein:

P(Empty Slots|Data Local Slots) is obtained by the Equation 2 below:

$\begin{matrix}{{P( {{Empty}\mspace{14mu} {Slots}} \middle| {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = \frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}} & \lbrack {{Equation}\mspace{14mu} 2} \rbrack\end{matrix}$

where n(Data Local Slots) denotes the number of map task slots of thedata nodes that store the input data of the map task out of theplurality of the data nodes, and n(Empty Slots) denotes the number oftask slots that do not perform tasks in the data node;

P(Data Local Slots) denotes a probability to select Data Local Slots outof a total number of map task slots of the data nodes, and P(Data LocalSlots) is obtained by the Equation 3 below:

$\begin{matrix}{{{P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = \frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}};} & \lbrack {{Equation}\mspace{14mu} 3} \rbrack\end{matrix}$

P(Empty Slots) denotes a probability to select the map task slot thatdoes not perform the task out of the plurality of task slots of the datanodes, and P(Empty Slots) is obtained by the Equation 4 below:

$\begin{matrix}{{{P( {{Empty}\mspace{14mu} {Slots}} )} = \frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}};} & \lbrack {{Equation}\mspace{14mu} 4} \rbrack\end{matrix}$

Total Slots is the total number of map task slots of the data nodes, andn(Total Slots) is obtained by the Equation 5 below:

n(Total Slots)=n(Data Local Slots)+n(Rack Local Slots)   [Equation 5]

where n(Data Local Slots) denotes the number of map task slots of thedata nodes that store the input data of the map task out of theplurality of the data nodes, and n(Rack Local Slots) is the number ofmap task slots of data nodes that do not store the input data of the maptask out of the plurality of the data nodes; and

Empty Slots are slots that do not perform tasks in a data node, andn(Total Empty Slots) is obtained by the Equation 6 below:

n(Total Empty Slots)=n(Data Local Empty Slots)+n(Rack Local EmptySlots)   [Equation 6]

where n(Data Local Empty Slots) denotes a number of empty slots that donot perform tasks among the map task slots of the data nodes that storethe input data of the map task in the plurality of the data nodes, andn(Rack Local Empty Slots) denotes a number of empty slots that do notperform tasks among the map task slots of the data nodes that do notstore the input data of the map task in the plurality of the data nodes.

The map manager includes a Data Local Job Probability (DLJP) calculatorcalculating a probability P(DLJ) that a map disk Data-Local-Map-Task(DLMT), which is allocated to the first map task slot ofData-Local-Map-Task-Slot out of the plurality of map tasks, occurs in ajob having an i-number of map tasks, and wherein the probability P(DLJ)is calculated by using the Equation 7 below:

                                 [Equation  7] $\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack 4\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= \frac{\sum\limits_{i = 0}^{n}\; {P( {DLMT}_{\lbrack 1\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}}\end{matrix}$

where n(Total Map Task) denotes a total number of the map tasks of theplurality of the data nodes.

The dynamic data replica creator replicates data in real time using theprobability P(DLJ), and schedules and allocates the map task to thesecond map task slot of Rack-Local-Map-Task-Slot when all of the firstmap task slots of Data-Local-Map-Task-Slots of the map task areperforming tasks.

The replication manager includes a replica eviction selector: measuringa frequency of access to Data[i] using the Equation 8 below; measuring afrequency of access to all data of a Hadoop Distributed File System(HDFS) using the Equation 9 below; and evicting a data replica when thefrequency of access to the Data[i] is lower than the frequency of accessto all data of the HDFS, wherein the Equation 8 is

${{{{Data}\lbrack i\rbrack} \cdot {Access}}\mspace{14mu} {Frequency}} = \frac{{{{Data}\lbrack i\rbrack} \cdot {Access}}\mspace{14mu} {Count}}{{{{Data}\lbrack i\rbrack} \cdot {Stored}}\mspace{14mu} {Time}}$

where Data[i].AccessFrequency denotes the frequency of access to theData[i], Data[i].StoredTime denotes a storage time of the Data[i],Data[i].AccessCount denotes a number of accesses to the Data[i], and iis a number of tasks; and

Equation 9 is

${{Data}\mspace{14mu} {Access}\mspace{14mu} {Frequency}\mspace{14mu} {of}\mspace{14mu} {HDFS}} = \frac{{Total}\mspace{14mu} {Data}\mspace{14mu} {Access}\mspace{14mu} {Count}}{{HDFS}\mspace{14mu} {Running}\mspace{14mu} {Time}}$

where Data.AccessFrequency of HDFS is the frequency of access to alldata of the HDFS, HDFS RunningTime is an operating time of the HDFS, andTotal Data Access Count is a number of accesses to all data of the HDFS.

According to another aspect of the present disclosure, a method ofdynamically replicating data for heterogeneous Hadoop is provided. Themethod includes calculating a probability that that a map task isallocated to a map task slot of a data node that stores an input datablock of the map task out of a plurality of map task slots of data nodesby using a number of the map task slots in Hadoop clusters comprisingheterogeneous clusters, and dynamically replicating data based on theprobability.

The step of the calculating of the probability further includesclassifying a first map task slot of Data-Local-Map-Task-Slot of thedata node that stores the input data block of the map task and a secondmap task slot of Rack-Local-Map-Task-Slot of a data node that does notstore the input data block of the map task, and calculating theprobability that the map task is allocated to the map task slot of thedata node that does not perform a task and stores the input data blockof the map task out of the plurality of map task slots of the data nodesusing Bayesian theory and the Equation 1:

P(DLMT)=P(Data Local Slots|Empty Slots)   [Equation 1]

where P(DLMT) denotes the probability to select Data Local Slots, eachof which is the map task slot of the data node that stores the inputdata of the map task in the data node, out of Empty Slots, which are themap task slots that do not perform the task.

The step of the calculating of the probability using Bayesian theoryfurther includes: calculating a probability P(DLJ) that a map diskData-Local-Map-Task (DLMT), which is allocated to the first map taskslot of Data-Local-Map-Task-Slot out of the plurality of map tasks,occurs in a job having an i-number of map tasks, and wherein theprobability P(DLJ) is calculated by using the Equation 2 below:

                                 [Equation  2] $\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack 4\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= \frac{\sum\limits_{i = 0}^{n}\; {P( {DLMT}_{\lbrack 1\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}}\end{matrix}$

where n(Total Map Task) denotes a total number of the map tasks of theplurality of the data nodes.

According to another aspect of the present disclosure, an apparatus fordynamically replicating data for a heterogeneous Hadoop is provided. Theapparatus includes a name node including a replication manager. Thereplication manager calculates a probability that a map task isallocated to a map task slot of a data node that stores an input datablock of the map task out of a plurality of map task slots of data nodesby using a number of the map task slots in the Hadoop clusterscomprising heterogeneous clusters, and dynamically replicates data basedon the probability.

The map manager includes a dynamic data replica creator classifying afirst map task slot of Data-Local-Map-Task-Slot of the data node thatstores the input data block of the map task and a second map task slotof Rack-Local-Map-Task-Slot of a data node that does not store the inputdata block of the map task, and calculating the probability that the maptask is allocated to the map task slot of the data node that does notperform a task and stores the input data block of the map task out ofthe plurality of map task slots of the data nodes using Bayesian theoryand the following equation:

P(DLMT)=P(Data Local Slots|Empty Slots)   [Equation]

where P(DLMT) denotes the probability to select Data Local Slots, eachof which is the map task slot of the data node that stores the inputdata of the map task in the data node, out of Empty Slots, which are themap task slots that do not perform the task.

The map manager includes a dynamic data replica creator classifying afirst map task slot of Data-Local-Map-Task-Slot of a data node thatstores the input data block of the map task and a second map task slotof Rack-Local-Map-Task-Slot of a data node that does not store the inputdata block of the map task, and calculating the probability that the maptask is allocated to the map task slot of the data node that does notperform a task and stores the input data block of the map task out ofthe plurality of map task slots of the data nodes using Bayesian theoryand the Equation 1 below:

                                     [Equation  1] $\begin{matrix}{{P({DLMT})} = \frac{{P( {{Empty}\mspace{14mu} {Slots}} \middle| {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} \times {P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}}{P( {{Empty}\mspace{14mu} {Slots}} )}} \\{= \frac{\frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} \times \frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}{\frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}}\end{matrix}$

where P(Empty Slots|Data Local Slots) denotes a probability to selecteach of which is the map task slot that does not perform the task, outof Data Local Slots, which are map task slots of data nodes that storethe input data of the map task in the plurality of data nodes, wherein:

P(Empty Slots|Data Local Slots) is obtained by the Equation 2 below:

$\begin{matrix}{{P( {{Empty}\mspace{14mu} {Slots}} \middle| {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = \frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}} & \lbrack {{Equation}\mspace{14mu} 2} \rbrack\end{matrix}$

where n(Data Local Slots) denotes the number of map task slots of thedata nodes that store the input data of the map task out of theplurality of the data nodes, and n(Empty Slots) denotes the number oftask slots that do not perform tasks in the data node;

P(Data Local Slots) denotes a probability to select Data Local Slots outof a total number of map task slots of the data nodes, and P(Data LocalSlots) is obtained by the Equation 3 below:

$\begin{matrix}{{{P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = \frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}};} & \lbrack {{Equation}\mspace{14mu} 3} \rbrack\end{matrix}$

P(Empty Slots) denotes a probability to select the map task slot thatdoes not perform the task out of the plurality of task slots of the datanodes, and P(Empty Slots) is obtained by the

$\begin{matrix}{{{P( {{Empty}\mspace{14mu} {Slots}} )} = \frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}};} & \lbrack {{Equation}\mspace{14mu} 4} \rbrack\end{matrix}$

Total Slots is the total number of map task slots of the data nodes, andn(Total Slots) is obtained by the Equation 5 below:

n(Total Slots)=n(Data Local Slots)+n(Rack Local Slots)   [Equation 5]

where n(Data Local Slots) denotes the number of map task slots of thedata nodes that store the input data of the map task out of theplurality of the data nodes, and n(Rack Local Slots) is the number ofmap task slots of data nodes that do not store the input data of the maptask out of the plurality of the data nodes; and

Empty Slots are slots that do not perform tasks in a data node, andn(Total Empty Slots) is obtained by the Equation 6 below:

n(Total Empty Slots)=n(Data Local Empty Slots)+n(Rack Local EmptySlots)   [Equation 6]

where n(Data Local Empty Slots) denotes a number of empty slots that donot perform tasks among the map task slots of the data nodes that storethe input data of the map task in the plurality of the data nodes, andn(Rack Local Empty Slots) denotes a number of empty slots that do notperform tasks among the map task slots of the data nodes that do notstore the input data of the map task in the plurality of the data nodes.

The map manager comprises a Data Local Job Probability (DLJP) calculatorcalculating a probability P(DLJ) that a map disk Data-Local-Map-Task(DLMT), which is allocated to the first map task slot ofData-Local-Map-Task-Slot out of the plurality of map tasks, occurs in ajob having an i-number of map tasks, and wherein the probability P(DLJ)is calculated by using the Equation 7 below:

                                     [Equation  7] $\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack i\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= \frac{\sum\limits_{i = 0}^{n}\; {P( {DLMT}_{\lbrack i\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}}\end{matrix}$

where n(Total Map Task) denotes a total number of the map tasks of theplurality of the data nodes.

The dynamic data replica creator replicates data in real time using theprobability P(DLJ), and schedules and allocates the map task to thesecond map task slot of Rack-Local-Map-Task-Slot when all of the firstmap task slots of Data-Local-Map-Task-Slots of the map task areperforming tasks.

The replication manager comprises a replica eviction selector: measuringa frequency of access to Data[i] using the Equation 8 below; measuring afrequency of access to all data of a Hadoop Distributed File System(HDFS) using the Equation 9 below; and evicting a data replica when thefrequency of access to the Data[i] is lower than the frequency of accessto all data of the HDFS, wherein the Equation (8) is

${{{{Data}\lbrack i\rbrack}.{Access}}\mspace{14mu} {Frequency}} = \frac{{{{Data}\lbrack i\rbrack}.{Access}}\mspace{14mu} {Count}}{{{{Data}\lbrack i\rbrack}.{Stored}}\mspace{14mu} {Time}}$

where Data[i].AccessFrequency denotes the frequency of access to theData[i], Data[i].StoredTime denotes a storage time of the Data[i],Data[i].AccessCount denotes a number of accesses to the Data[i], and iis a number of tasks; and

Equation 9 is

${{Data}\mspace{14mu} {Access}\mspace{14mu} {Frequency}\mspace{14mu} {of}\mspace{14mu} {HDFS}} = \frac{{Total}\mspace{14mu} {Data}\mspace{14mu} {Access}\mspace{14mu} {Count}}{{HDFS}\mspace{14mu} {Running}\mspace{14mu} {Time}}$

where Data.AccessFrequency of HDFS is the frequency of access to alldata of the HDFS, HDFS RunningTime is an operating time of the HDFS, andTotal Data Access Count is a number of accesses to all data of the HDFS.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects, features and advantages of the presentdisclosure will become more apparent to those of ordinary skill in theart by describing exemplary embodiments thereof in detail with referenceto the accompanying drawings, in which:

FIG. 1 is a diagram showing a configuration of a system for dynamicallyreplicating data for heterogeneous Hadoop according to one embodiment ofthe present disclosure;

FIG. 2 is a diagram showing a data flow of a MapReduce job according tothe embodiment of the present disclosure; and

FIG. 3 is a diagram showing an aspect of classifyingData-Local-Map-Task-Slots and Rack-Local-Map-Task-Slots on the basis ofMap Task1 receiving Block1 as input data according to the embodiment ofthe present disclosure.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

Hereinafter, when one part is referred to as “comprising” (or“including” or “having”) other elements, it should be understood thatthe one part can comprise (or include or have) only those elements, orother elements as well as those elements unless specifically describedotherwise.

A conventional Hadoop cluster structure has a single name node andmultiple data nodes. The name node includes a job tracker, a name node,and a secondary name node.

The name node manages a namespace in a file system, and the namespacehas a directory, a file name, a file block, location information ofdata, and the like.

The data node is configured with block units that actually store files.

MapReduce is a framework for distributing a large amount of data toseveral nodes to process the data. A MapReduce job is a unit of workthat a user wants to be performed.

The map task receives a data block stored in a Hadoop Distributed FileSystem (HDFS).

Since the result data obtained by performing the map task is anintermediate result that is used only as input data of a Reduce task,the result data is stored in a local disk of the data node that performsthe task rather than in the HDFS.

Also, a map disk may obtain a most high performance thereof when thetask is performed by the data node that stores the input data.

FIG. 1 is a diagram showing a configuration of a system for dynamicallyreplicating data for heterogeneous Hadoop according to one embodiment ofthe present disclosure, FIG. 2 is a diagram showing a data flow of aMapReduce job according to the embodiment of the present disclosure, andFIG. 3 is a diagram showing an aspect of classifyingData-Local-Map-Task-Slots and Rack-Local-Map-Task-Slots on the basis ofMap Task1 receiving Block1 as input data according to the embodiment ofthe present disclosure.

In a Hadoop having heterogeneous clusters, there are differences inperformance among data nodes. Thus, the number of map tasks that may besimultaneously executed by each node is different.

Accordingly, the present disclosure sets a map task slot of each datanode on the basis of performance thereof, unlike a conventional datareplication method that does not consider performance.

The number of map slots refers to the number of map tasks that can besimultaneously executed by the data node.

A reduce task receives a result data of a map task as input data, andthe result data of the map task is not stored in the HDFS but in a localfile system. Thus, it is difficult to expect a performance enhancementdue to data replication.

Unlike this, input data of the map task is data blocks stored in theHDFS and affects the performance of the map task depending on the numberof data replicas.

Accordingly, the present disclosure uses map slots other than a Reduceslot as a criterion for the performance of heterogeneous clusters.

A data flow of MapReduce jobs is shown in FIG. 2.

A system for dynamically replicating data for heterogeneous Hadoopaccording to one embodiment of the present disclosure includes a namenode 100 including a job tracker 110 and a replication manager 120.

The replication manager 120 includes a dynamic data replica creator 122,a data local job probability (DLJP) calculator 124, and a replicaeviction selector 126.

The dynamic data replica creator 122 classifiesData-Local-Map-Task-Slots 132, each of which is a map task slot of adata node 131 that stores an input data block of a map task 130, andRack-Local-Map-Task-Slots 134, each of which is a map task slot of thedata node 131 that does not store the input data block of the map task130.

A method of classifying the Data-Local-Map-Task-Slots 132 and theRack-Local-Map-Task-Slots 134 is shown in FIG. 3.

The map task 130 allocated to the Data-Local-Map-Task-Slots 132 can behereinafter referred to as DLMT, and the map task 130 allocated to theRack-Local-Map-Task-Slots 134 can be hereinafter referred to as RLMT.

As shown in FIG. 3, the dynamic data replica creator 122 classifies theData-Local-Map-Task-Slots 132 and the Rack-Local-Map-Task-Slots 134 onthe basis of Map Task1 receiving Block1 130 as input data. According tothe same principle, the dynamic data replica creator 122 classifies theData-Local-Map-Task-Slots 132 and the Rack-Local-Map-Task-Slots 134 withrespect to other blocks.

The dynamic data replica creator 122 measures a probability P(DLMT) thatthe map task 130 will be allocated to Data-Local-Map-Task-Slots that donot work among a total number of map task slots of data nodes usingEquation (1) below, to which Bayesian theory is applied.

The Bayesian theory is a method of measuring prior probabilities ofevents that may be causes of a specific event after the specific eventoccurs and deducing posterior probabilities of events that may becomecauses later using the measured prior probabilities.

                                     [Equation  (1)] $\begin{matrix}{{P({DLMT})} = {P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} \middle| {{Empty}\mspace{14mu} {Slots}} )}} \\{= \frac{{P( {{Empty}\mspace{14mu} {Slots}} \middle| {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} \times {P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}}{P( {{Empty}\mspace{14mu} {Slots}} )}} \\{= {\frac{\frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} \times \frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}{\frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}.}}\end{matrix}$

Total Slots denotes the total number of map task slots of data nodes.

N(Total Slots) is found using Equation (2) below:

n(Total Slots)=n(Data Local Slots)+n(Rack Local Slots)   [Equation (2)]

where n(Data Local Slots) denotes the number of map task slots of datanodes that store input data of a map task out of all of the plurality ofdata nodes, and n(Rack Local Slots) denotes the number of map task slotsof data nodes that do not store the input data of the map task out ofall of the plurality of data nodes.

Empty Slots denotes slots that do not perform tasks in data nodes.

n(Total Empty Slots) is obtained by using Equation (3) below:

n(Total Empty Slots)=n(Data Local Empty Slots)+n(Rack Local EmptySlots)   [Equation (3)]

where n(Data Local Empty Slots) denotes the number of empty slots thatdo not perform tasks among the map task slots of the data nodes thatstore the input data of the map task among in all of the plurality ofthe data nodes, and n(Rack Local Empty Slots) denotes the number ofempty slots that do not perform tasks among the map task slots of thedata nodes that do not store the input data of the map task in all ofthe plurality of the data nodes.

P(Empty Slots|Data Local Slots) denotes a probability to select EmptySlots, each of which is a map task slot that does not perform the task,out of Data Local Slots.

P(Empty Slots|Data Local Slots) is obtained by using Equation (4) below:

$\begin{matrix}{{P( {{Empty}\mspace{14mu} {Slots}} \middle| {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = {\frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}.}} & \lbrack {{Equation}\mspace{14mu} (4)} \rbrack\end{matrix}$

P(Data Local Slots) denotes a probability to select Data Local Slots outof a total number of map task slots of the data nodes.

P(Data Local Slots) is obtained by using Equation (5) below:

$\begin{matrix}{{P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = {\frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}.}} & \lbrack {{Equation}\mspace{14mu} (5)} \rbrack\end{matrix}$

P(Empty Slots) denotes a probability to select a map task slot that doesnot perform task out of all of the plurality of slots of the data nodes.

P(Empty Slots) is obtained by using Equation (6) below:

$\begin{matrix}{{P( {{Empty}\mspace{14mu} {Slots}} )} = {\frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}.}} & \lbrack {{Equation}\mspace{14mu} (6)} \rbrack\end{matrix}$

The DLJP calculator 124 calculates a probability P(DLJ) that, in a jobcomposed of an i-number of map tasks, DLMT will occur among all of theplurality of map tasks, using Equation (7).

P(DLJ) is obtained by using Equation (7). Here, i is the number oftasks.

                                     [Equation  7] $\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack i\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= {\frac{\sum\limits_{i = 0}^{n}\; {P( {DLMT}_{\lbrack i\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}.}}\end{matrix}$

An algorithm for measuring P(DLJ) is shown as follows:

Calculate Data-Local-Job-Probability Initialize TotalDataNodeSlots to 0for DataNode[i] in DataNode[0]...DataNode[n−1] set TotalDataNodeSlots =TotalDataNodeSlots + DataNode[i].Slots end for when Data[x].Block[i] isthe input block for Task[i] Initialize DataLocalSlots to 0 for allBlocks in Data[x] for Block[i] in Block[0]...Block[n−1] for DataNode[j]in DataNode[0]...DataNode[n−1] if DataNode[j] has DataLocalNode ofData[x].Block[i] then set Data[x].Block[i].DataLocalSlots += DataNode[j]empty slots end if end for setData[x].Block[i].DataLocalMapTaskProbability =Data[x].Block[i].DataLocalSlots / TotalDataNodeSlots end for whenData[x] is the input data for this job InitailizeData[x].LocalJobProbability to 0 for Block[i] in Block[0]...Block[n−1]set Data[x].DataLocalJobProbability +=Data[x].Block[i].DataLocalProbability end for setData[x].DataLocalJobProbability =  Data[x].DataLocalJobProbability /TotalDataBlocks

The dynamic data replica creator 122 replicates data in real time usingwhich is calculated by the DLJP calculator 124.

When all Data-Local-Map-Task-Slots of the map task 130 are performingtasks, the dynamic data replica creator 122 schedules and allocates themap task to Rack-Local-Map-Task-Slots.

The dynamic data replica creator 122 creates any value r when thescheduling is performed. When the value r is greater than P(DLJ), thedynamic data replica creator 122 replicates data and updatesDataLocalJobProbability and AccessCount.

When the value r is less than P(DLJ), the dynamic data replica creator122 temporarily replicates a data block and updates AccessCount. Analgorithm for dynamically replicating data is shown as follows:

Dynamic Replication logic based on Data-Local-Job-Probability when aheartbeat is received from datanode if a rack-local-map-task isscheduled then Generate random number r (0<r<1) when Data[x] is theinput data for job get DataLocalJobProbability of job if r >DataLocalJobProbability then replicate Data[x] increaseData[x].ReplicationFactor increase Data[x].AccessCount updateData[x].DataLocalJobProbability else create Cache increaseData[x].AccessCount end if end if

A Least Recently Frequently Access (LRFA) data eviction method isproposed in order to evict old data that has a low access frequencyamong data stored in an HDFS to enhance efficiency of storage space ofthe HDFS.

The replica eviction selector 126 measures a frequency of access toData[i] and a frequency of access to all data of the HDFS using the LRFAdata eviction method.

When the frequency of access to the Data[i] is lower than the frequencyof access to all data of the HDFS, the replica eviction selector 126evicts a data replica.

The frequency of access to the Data[i] is measured using Equation (8)below:

$\begin{matrix}{{{{{Data}\lbrack i\rbrack}.{Access}}\mspace{14mu} {Frequency}} = \frac{{{{Data}\lbrack i\rbrack}.{Access}}\mspace{14mu} {Count}}{{{{Data}\lbrack i\rbrack}.{Stored}}\mspace{14mu} {Time}}} & \lbrack {{Equation}\mspace{14mu} (8)} \rbrack\end{matrix}$

where Data[i].AccessFrequency denotes the frequency of access to theData[i], Data[i].StoredTime denotes a storage time of the Data[i], andData[i].AccessCount denotes the number of accesses to the Data[i].

The frequency of access to all data of HDFS is measured using Equation(9) below:

${{Data}\mspace{14mu} {Access}\mspace{14mu} {Frequency}\mspace{14mu} {of}\mspace{14mu} {HDFS}} = \frac{{Total}\mspace{14mu} {Data}\mspace{14mu} {Access}\mspace{14mu} {Count}}{{HDFS}\mspace{14mu} {Running}\mspace{14mu} {Time}}$

where Data.AccessFrequency of HDFS denotes the frequency of access toall of the data of the HDFS, HDFS RunningTime denotes an operating timeof the HDFS, and Total Data Access Count denotes the number of accessesto all of the data. An algorithm for the LRFA data eviction is shown asfollows:

Dynamic Replication logic based on Data-Local-Job-Probability when aheartbeat is received from datanode if a rack-local-map-task isscheduled then Generate random number r (0<r<1) when Data[x] is theinput data for job get DataLocalJobProbability of job if r >DataLocalJobProbability then replicate Data[x] increaseData[x].ReplicationFactor increase Data[x].AccessCount updateData[x].DataLocalJobProbability else create Cache increaseData[x].AccessCount end if end if

According to the above-described configuration, the present disclosurecan improve performance when data is replicated by setting a map taskslot of a data node as a criterion for performance of each node inheterogeneous Hadoop.

The present disclosure can improve performance of MapReduce by applyingdynamic data replication in heterogeneous Hadoop.

The above-described embodiment of the present disclosure is notimplemented only by an apparatus and/or method, but can be implementedthrough a program for realizing functions corresponding to aconfiguration of the embodiment of the present disclosure and arecording medium having the program recorded thereon. Suchimplementation can be easily made by those skilled in the art to whichthe present disclosure pertains from the above description of theembodiment.

Although the embodiment of the present disclosure has been described indetail, the scope of the present disclosure is not limited thereto, butmodifications and alterations made by those skilled in the art using thebasic concept of the present disclosure defined in the following claimsfall within the scope of the present disclosure.

What is claimed is:
 1. A system for dynamically replicating data forheterogeneous Hadoop, the system comprising a name node including areplication manager, wherein the replication manager calculates aprobability that a map task is allocated to a map task slot of a datanode that stores an input data block of the map task out of a pluralityof map task slots of data nodes by using a number of the map task slotsin Hadoop clusters comprising heterogeneous clusters, and dynamicallyreplicates data based on the probability.
 2. The system of claim 1,wherein the map manager comprises a dynamic data replica creatorclassifying a first map task slot of Data-Local-Map-Task-Slot of thedata node that stores the input data block of the map task and a secondmap task slot of Rack-Local-Map-Task-Slot of a data node that does notstore the input data block of the map task, and calculating theprobability that the map task is allocated to the map task slot of thedata node that does not perform a task and stores the input data blockof the map task out of the plurality of map task slots of the data nodesusing Bayesian theory and the following equation:P(DLMT)=P(Data Local Slots|Empty Slots)   [Equation] where P(DLMT)denotes the probability to select Data Local Slots, each of which is themap task slot of the data node that stores the input data of the maptask in the data node, out of Empty Slots, which are the map task slotsthat do not perform the task.
 3. The system of claim 1, wherein the mapmanager includes a dynamic data replica creator classifying a first maptask slot of Data-Local-Map-Task-Slot of a data node that stores theinput data block of the map task and a second map task slot ofRack-Local-Map-Task-Slot of a data node that does not store the inputdata block of the map task, and calculating the probability that the maptask is allocated to the map task slot of the data node that does notperform a task and stores the input data block of the map task out ofthe plurality of map task slots of the data nodes using Bayesian theoryand the Equation 1 below:                                     [Equation  1] $\begin{matrix}{{P({DLMT})} = \frac{{P( {{Empty}\mspace{14mu} {Slots}} \middle| {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} \times {P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}}{P( {{Empty}\mspace{14mu} {Slots}} )}} \\{= \frac{\frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} \times \frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}{\frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}}\end{matrix}$ where P(Empty Slots|Data Local Slots) denotes aprobability to select Empty Slots, each of which is the map task slotthat does not perform the task, out of Data Local Slots, which are maptask slots of data nodes that store the input data of the map task inthe plurality of data nodes, wherein: P(Empty Slots|Data Local Slots) isobtained by the Equation 2 below: $\begin{matrix}{{P( {{Empty}\mspace{14mu} {Slots}} \middle| {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = \frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}} & \lbrack {{Equation}\mspace{14mu} 2} \rbrack\end{matrix}$ where n(Data Local Slots) denotes the number of map taskslots of the data nodes that store the input data of the map task out ofthe plurality of the data nodes, and n(Empty Slots) denotes the numberof task slots that do not perform tasks in the data node; P(Data LocalSlots) denotes a probability to select Data Local Slots out of a totalnumber of map task slots of the data nodes, and P(Data Local Slots) isobtained by the Equation 3 below: $\begin{matrix}{{{P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = \frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}};} & \lbrack {{Equation}\mspace{14mu} 3} \rbrack\end{matrix}$ P(Empty Slots) denotes a probability to select the maptask slot that does not perform the task out of the plurality of taskslots of the data nodes, and P(Empty Slots) is obtained by the Equation4 below: $\begin{matrix}{{{P( {{Empty}\mspace{14mu} {Slots}} )} = \frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}};} & \lbrack {{Equation}\mspace{14mu} 4} \rbrack\end{matrix}$ Total Slots is the total number of map task slots of thedata nodes, and n(Total Slots) obtained by the Equation 5 below:n(Total Slots)=n(Data Local Slots)+n(Rack Local Slots)   [Equation 5]where n(Data Local Slots) denotes the number of map task slots of thedata nodes that store the input data of the map task out of theplurality of the data nodes, and n(Rack Local Slots) is the number ofmap task slots of data nodes that do not store the input data of the maptask out of the plurality of the data nodes; and Empty Slots are slotsthat do not perform tasks in a data node, and n(Total Empty Slots) isobtained by the Equation 6 below:n(Total Empty Slots)=n(Data Local Empty Slots)+n(Rack Local EmptySlots)   [Equation 6] where n(Data Local Empty Slots) denotes a numberof empty slots that do not perform tasks among the map task slots of thedata nodes that store the input data of the map task in the plurality ofthe data nodes, and n(Rack Local Empty Slots) denotes a number of emptyslots that do not perform tasks among the map task slots of the datanodes that do not store the input data of the map task in the pluralityof the data nodes.
 4. The system of claim 2, wherein the map managercomprises a Data Local Job Probability (DLJP) calculator calculating aprobability P(DLJ) that a map disk Data-Local-Map-Task (DLMT), which isallocated to the first map task slot of Data-Local-Map-Task-Slot out ofthe plurality of map tasks, occurs in a job having an i-number of maptasks, and wherein the probability P(DLJ) is calculated by using theEquation 7 below:                                      [Equation  7]$\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack i\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= \frac{\sum\limits_{i = 0}^{n}{P( {DLMT}_{\lbrack i\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}}\end{matrix}$ where n(Total Map Task) denotes a total number of the maptasks of the plurality of the data nodes.
 5. The system of claim 3,wherein the map manager comprises a Data Local Job Probability (DLJP)calculator calculating a probability P(DLJ) that a map diskData-Local-Map-Task (DLMT), which is allocated to the first map taskslot of Data-Local-Map-Task-Slot out of the plurality of map tasks,occurs in a job having an i-number of map tasks, and wherein theprobability P(DLJ) is calculated by using the Equation 7 below:                                     [Equation  7] $\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack i\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= \frac{\sum\limits_{i = 0}^{n}{P( {DLMT}_{\lbrack i\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}}\end{matrix}$ where n(Total Map Task) denotes a total number of the maptasks of the plurality of the data nodes.
 6. The system of claim 4,wherein the dynamic data replica creator replicates data in real timeusing the probability P(DLJ), and schedules and allocates the map taskto the second map task slot of Rack-Local-Map-Task-Slot when all of thefirst map task slots of Data-Local-Map-Task-Slots of the map task areperforming tasks.
 7. The system of claim 1, wherein the replicationmanager comprises a replica eviction selector: measuring a frequency ofaccess to Data[i] using the Equation 8 below; measuring a frequency ofaccess to all data of a Hadoop Distributed File System (HDFS) using theEquation 9 below; and evicting a data replica when the frequency ofaccess to the Data[i] is lower than the frequency of access to all dataof the HDFS, wherein the Equation (8) is${{{{Data}\lbrack i\rbrack}.{Access}}\mspace{14mu} {Frequency}} = \frac{{{{Data}\lbrack i\rbrack}.{Access}}\mspace{14mu} {Count}}{{{{Data}\lbrack i\rbrack}.{Stored}}\mspace{14mu} {Time}}$where Data[i].AccessFrequency denotes the frequency of access to theData[i], Data[i].StoredTime denotes a storage time of the Data[i],Data[i].AccessCount denotes a number of accesses to the Data[i], and iis a number of tasks; and Equation 9 is${{Data}\mspace{14mu} {Access}\mspace{14mu} {Frequency}\mspace{14mu} {of}\mspace{14mu} {HDFS}} = \frac{{Total}\mspace{14mu} {Data}\mspace{14mu} {Access}\mspace{14mu} {Count}}{{HDFS}\mspace{14mu} {Running}\mspace{14mu} {Time}}$ where Data.AccessFrequency of HDFS is the frequency of access to alldata of the HDFS, HDFS RunningTime is an operating time of the HDFS, andTotal Data Access Count is a number of accesses to all data of the HDFS.8. A method of dynamically replicating data for heterogeneous Hadoop,the method comprising: calculating a probability that that a map task isallocated to a map task slot of a data node that stores an input datablock of the map task out of a plurality of map task slots of data nodesby using a number of the map task slots in Hadoop clusters comprisingheterogeneous clusters; and dynamically replicating data based on theprobability.
 9. The method of claim 8, wherein the step of thecalculating of the probability further comprises: classifying a firstmap task slot of Data-Local-Map-Task-Slot of the data node that storesthe input data block of the map task and a second map task slot ofRack-Local-Map-Task-Slot of a data node that does not store the inputdata block of the map task; and calculating the probability that the maptask is allocated to the map task slot of the data node that does notperform a task and stores the input data block of the map task out ofthe plurality of map task slots of the data nodes using Bayesian theoryand the Equation 1:P(DLMT)=P(Data Local Slots|Empty Slots)   [Equation 1] where P(DLMT)denotes the probability to select Data Local Slots, each of which is themap task slot of the data node that stores the input data of the maptask in the data node, out of Empty Slots, which are the map task slotsthat do not perform the task.
 10. The method of claim 9, wherein thestep of the calculating of the probability using Bayesian theorycomprises: calculating a probability P(DLJ) that a map diskData-Local-Map-Task (DLMT), which is allocated to the first map taskslot of Data-Local-Map-Task-Slot out of the plurality of map tasks,occurs in a job having an i-number of map tasks, and wherein theprobability P(DLJ) is calculated by using the Equation 2 below:                                     [Equation  2] $\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack i\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= \frac{\sum\limits_{i = 0}^{n}{P( {DLMT}_{\lbrack i\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}}\end{matrix}$ where n(Total Map Task) denotes a total number of the maptasks of the plurality of the data nodes.
 11. An apparatus fordynamically replicating data for heterogeneous Hadoop, the apparatuscomprising a name node including a replication manager, wherein thereplication manager calculates a probability that a map task isallocated to a map task slot of a data node that stores an input datablock of the map task out of a plurality of map task slots of data nodesby using a number of the map task slots in Hadoop clusters comprisingheterogeneous clusters, and dynamically replicates data based on theprobability.
 12. The apparatus of claim 11, wherein the map managercomprises a dynamic data replica creator classifying a first map taskslot of Data-Local-Map-Task-Slot of the data node that stores the inputdata block of the map task and a second map task slot ofRack-Local-Map-Task-Slot of a data node that does not store the inputdata block of the map task, and calculating the probability that the maptask is allocated to the map task slot of the data node that does notperform a task and stores the input data block of the map task out ofthe plurality of map task slots of the data nodes using Bayesian theoryand the following equation:P(DLMT)=P(Data Local Slots|Empty Slots)   [Equation] where P(DLMT)denotes the probability to select Data Local Slots, each of which is themap task slot of the data node that stores the input data of the maptask in the data node, out of Empty Slots, which are the map task slotsthat do not perform the task.
 13. The apparatus of claim 11, wherein themap manager includes a dynamic data replica creator classifying a firstmap task slot of Data-Local-Map-Task-Slot of a data node that stores theinput data block of the map task and a second map task slot ofRack-Local-Map-Task-Slot of a data node that does not store the inputdata block of the map task, and calculating the probability that the maptask is allocated to the map task slot of the data node that does notperform a task and stores the input data block of the map task out ofthe plurality of map task slots of the data nodes using Bayesian theoryand the Equation 1 below: $\begin{matrix}\begin{matrix}{{P({DLMT})} = \frac{\begin{matrix}{{P( {{{Empty}\mspace{14mu} {Slots}}{{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}}} )}} \\{P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}\end{matrix}}{P( {{Empty}\mspace{14mu} {Slots}} )}} \\{= \frac{\frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}\frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}{\frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}}}\end{matrix} & \lbrack {{Equation}\mspace{14mu} 1} \rbrack\end{matrix}$ where P(Empty Slots|Data Local Slots) denotes aprobability to select Empty Slots, each of which is the map task slotthat does not perform the task, out of Data Local Slots, which are maptask slots of data nodes that store the input data of the map task inthe plurality of data nodes, wherein: P(Empty Slots|Data Local Slots) isobtained by the Equation 2 below: $\begin{matrix}{{P( {{{Empty}\mspace{14mu} {Slots}}{{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}}} )} = \frac{n( {{Empty}\mspace{14mu} {Slots}} )}{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}} & \lbrack {{Equation}\mspace{14mu} 2} \rbrack\end{matrix}$ where n(Data Local Slots) denotes the number of map taskslots of the data nodes that store the input data of the map task out ofthe plurality of the data nodes, and n(Empty Slots) denotes the numberof task slots that do not perform tasks in the data node; P(Data LocalSlots) denotes a probability to select Data Local Slots out of a totalnumber of map task slots of the data nodes, and P(Data Local Slots) isobtained by the Equation 3 below: $\begin{matrix}{{{P( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )} = \frac{n( {{Data}\mspace{14mu} {Local}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}};} & \lbrack {{Equation}\mspace{14mu} 3} \rbrack\end{matrix}$ P(Empty Slots) denotes a probability to select the maptask slot that does not perform the task out of the plurality of taskslots of the data nodes, and P(Empty Slots) is obtained by the Equation4 below: $\begin{matrix}{{{P( {{Empty}\mspace{14mu} {Slots}} )} = \frac{n( {{Total}\mspace{14mu} {Empty}\mspace{14mu} {Slots}} )}{n( {{Total}\mspace{14mu} {Slots}} )}};} & \lbrack {{Equation}\mspace{14mu} 4} \rbrack\end{matrix}$ Total Slots is the total number of map task slots of thedata nodes, and n(Total Slots) is obtained by the Equation 5 below:n(Total Slots)=n(Data Local Slots)+n(Rack Local Slots)   [Equation 5]where n(Data Local Slots) denotes the number of map task slots of thedata nodes that store the input data of the map task out of theplurality of the data nodes, and n(Rack Local Slots) is the number ofmap task slots of data nodes that do not store the input data of the maptask out of the plurality of the data nodes; and Empty Slots are slotsthat do not perform tasks in a data node, and n(total Empty Slots) isobtained by the Equation 6 below:n(Total Empty Slots)=n(Data Local Empty Slots)+n(Rack Local EmptySlots)   [Equation 6] where n(Data Local Empty Slots) denotes a numberof empty slots that do not perform tasks among the map task slots of thedata nodes that store the input data of the map task in the plurality ofthe data nodes, and n(Rack Local Empty Slots) denotes a number of emptyslots that do not perform tasks among the map task slots of the datanodes that do not store the input data of the map task in the pluralityof the data nodes.
 14. The apparatus of claim 12, wherein the mapmanager comprises a Data Local Job Probability (DLJP) calculatorcalculating a probability P(DLJ) that a map disk Data-Local-Map-Task(DLMT), which is allocated to the first map task slot ofData-Local-Map-Task-Slot out of the plurality of map tasks, occurs in ajob having an i-number of map tasks, and wherein the probability P(DLJ)is calculated by using the Equation 7 below:                                     [Equation  7] $\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack i\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= \frac{\sum\limits_{i = 0}^{n}{P( {DLMT}_{\lbrack i\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}}\end{matrix}$ where n(Total Map Task) denotes a total number of the maptasks of the plurality of the data nodes.
 15. The apparatus of claim 13,wherein the map manager comprises a Data Local Job Probability (DLJP)calculator calculating a probability P(DLJ) that a map diskData-Local-Map-Task (DLMT), which is allocated to the first map taskslot of Data-Local-Map-Task-Slot out of the plurality of map tasks,occurs in a job having an i-number of map tasks, and wherein theprobability P(DLJ) is calculated by using the Equation 7 below:                                     [Equation  7] $\begin{matrix}{{P({DLJ})} = \frac{P( {{DLMT}_{\lbrack 1\rbrack} + {DLMT}_{\lbrack 2\rbrack} + {DLMT}_{\lbrack 3\rbrack} + \ldots + {DLMT}_{\lbrack i\rbrack}} )}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}} \\{= \frac{\sum\limits_{i = 0}^{n}{P( {DLMT}_{\lbrack i\rbrack} )}}{n( {{Total}\mspace{14mu} {Map}\mspace{14mu} {Task}} )}}\end{matrix}$ where n(Total Map Task) denotes a total number of the maptasks of the plurality of the data nodes.
 16. The apparatus of claim 14,wherein the dynamic data replica creator replicates data in real timeusing the probability P(DLJ), and schedules and allocates the map taskto the second map task slot of Rack-Local-Map-Task-Slot when all of thefirst map task slots of Data-Local-Map-Task-Slots of the map task areperforming tasks.
 17. The apparatus of claim 11, wherein the replicationmanager comprises a replica eviction selector: measuring a frequency ofaccess to Data[i] using the Equation 8 below; measuring a frequency ofaccess to all data of a Hadoop Distributed File System (HDFS) using theEquation 9 below; and evicting a data replica when the frequency ofaccess to the Data[i] is lower than the frequency of access to all dataof the HDFS, wherein the Equation (8) is${{{{Data}\lbrack i\rbrack}.{Access}}\mspace{14mu} {Frequency}} = \frac{{{{Data}\lbrack i\rbrack}.{Access}}\mspace{14mu} {Count}}{{{{Data}\lbrack i\rbrack}.{Stored}}\mspace{14mu} {Time}}$where Data[i].AccessFrequency denotes the frequency of access to theData[i], Data[i].StoredTime denotes a storage time of the Data[i],Data[i].AccessCount denotes a number of accesses to the Data[i], and iis a number of tasks; and Equation 9 is${{Data}\mspace{14mu} {Access}\mspace{14mu} {Frequency}\mspace{14mu} {of}\mspace{14mu} {HDFS}} = \frac{{Total}\mspace{14mu} {Data}\mspace{14mu} {Access}\mspace{14mu} {Count}}{{HDFS}\mspace{14mu} {Running}\mspace{14mu} {Time}}$ where Data.AccessFrequency of HDFS is the frequency of access to alldata of the HDFS, HDFS RunningTime is an operating time of the HDFS, andTotal Data Access Count is a number of accesses to all data of the HDFS.